Support aoti_torch_cuda__weight_int4pack_mm #15030

desertfire · 2025-10-10T23:25:59Z

Stack from ghstack (oldest at bottom):

-> Support aoti_torch_cuda__weight_int4pack_mm #15030

Summary: When quantizing a model with 4w_hqq (huggingface/optimum-executorch#164), AOTI-generated code will call aoti_torch_cuda__weight_int4pack_mm as a fallback op. This PR borrows the CUDA implementation of _weight_int4pack_mm_cuda from libtorch, by replacing at::Tensor and relevant utility functions with ET equivalents.

Using the Voxtral runner as an example (tested on H100),

With the bfloat16 format, here is the generated ptd file size and latency.

optimum-cli export executorch \
    --model "mistralai/Voxtral-Mini-3B-2507" \
    --task "multimodal-text-to-text" \
    --recipe "cuda" \
    --dtype bfloat16 \
    --device cuda \
    --max_seq_len 1024 \
    --output_dir ./

aoti_cuda_blob.ptd: 9.0 GB

Program load latency (ms): 0.054
Method load latency (ms):
  audio_encoder: 1492.989
  token_embedding: 803.561
  text_decoder: 6556.770
Run latency (ms):
  audio_encoder: 76.848
  token_embedding: 6.479
  text_decoder: 149.128

With --qlinear 4w_hqq --qlinear_encoder 4w_hqq, the ptd file size is cut more than half, with slowdowns in the encoder and decoder parts.

aoti_cuda_blob.ptd: 3.7 GB

Program load latency (ms): 0.051
Method load latency (ms):
  audio_encoder: 716.667
  token_embedding: 633.476
  text_decoder: 1840.760
Run latency (ms):
  audio_encoder: 329.274
  token_embedding: 4.285
  text_decoder: 335.590

Here is the result with --qlinear 4w --qlinear_encoder 4w, where weights are quantized, but the linear is done with dequant + fp16_matmul. Comparing with 4w_hqq, the generated file size is a bit larger, but the computation is surprisingly faster. Needs more investigation.

aoti_cuda_blob.ptd: 5.4G 

Program load latency (ms): 0.064
Method load latency (ms):
  audio_encoder: 872.016
  token_embedding: 663.107
  text_decoder: 3104.973
Run latency (ms):
  audio_encoder: 75.777
  token_embedding: 4.067
  text_decoder: 149.420

Differential Revision: D84395275

Summary: When quantizing a model with 4w_hqq (huggingface/optimum-executorch#164), AOTI-generated code will call aoti_torch_cuda__weight_int4pack_mm as a fallback op. This PR borrows the CUDA implementation of _weight_int4pack_mm_cuda from libtorch, by replacing at::Tensor and relevant utility functions with ET equivalents. Using the Voxtral runner as an example, With the bfloat16 format, here is the generated ptd file size and latency. ``` aoti_cuda_blob.ptd: 9.0 GB Program load latency (ms): 0.054 Method load latency (ms): audio_encoder: 1492.989 token_embedding: 803.561 text_decoder: 6556.770 Run latency (ms): audio_encoder: 76.848 token_embedding: 6.479 text_decoder: 149.128 ``` With `--qlinear 4w_hqq --qlinear_encoder 4w_hqq`, the ptd file size is cut more than half, with slowdowns in the encoder and decoder parts. ``` aoti_cuda_blob.ptd: 3.7 GB Program load latency (ms): 0.051 Method load latency (ms): audio_encoder: 716.667 token_embedding: 633.476 text_decoder: 1840.760 Run latency (ms): audio_encoder: 329.274 token_embedding: 4.285 text_decoder: 335.590 ``` [ghstack-poisoned]

Summary: When quantizing a model with 4w_hqq (huggingface/optimum-executorch#164), AOTI-generated code will call aoti_torch_cuda__weight_int4pack_mm as a fallback op. This PR borrows the CUDA implementation of _weight_int4pack_mm_cuda from libtorch, by replacing at::Tensor and relevant utility functions with ET equivalents. Using the Voxtral runner as an example, With the bfloat16 format, here is the generated ptd file size and latency. ``` aoti_cuda_blob.ptd: 9.0 GB Program load latency (ms): 0.054 Method load latency (ms): audio_encoder: 1492.989 token_embedding: 803.561 text_decoder: 6556.770 Run latency (ms): audio_encoder: 76.848 token_embedding: 6.479 text_decoder: 149.128 ``` With `--qlinear 4w_hqq --qlinear_encoder 4w_hqq`, the ptd file size is cut more than half, with slowdowns in the encoder and decoder parts. ``` aoti_cuda_blob.ptd: 3.7 GB Program load latency (ms): 0.051 Method load latency (ms): audio_encoder: 716.667 token_embedding: 633.476 text_decoder: 1840.760 Run latency (ms): audio_encoder: 329.274 token_embedding: 4.285 text_decoder: 335.590 ``` ghstack-source-id: a543a05 Pull Request resolved: #15030

pytorch-bot · 2025-10-10T23:26:06Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/15030

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❗ 2 Active SEVs

There are 2 currently active SEVs. If your PR is affected, please view them below:

❌ 3 New Failures, 4 Unrelated Failures

As of commit 7dbdad2 with merge base afd98fe ():

NEW FAILURES - The following jobs have failed:

pull / test-qnn-wheel-packages-linux (3.10) / linux-job (gh)
RuntimeError: Command docker exec -t 877c47f9ded2742cb2ac5d27940f3752062c64010f4dc417daea7efe9a7d7cb9 /exec failed with exit code 1
pull / test-qnn-wheel-packages-linux (3.11) / linux-job (gh)
RuntimeError: Command docker exec -t 212810b862f38829550e818b3ea0cbc87df0ddddced64473cbf738110d9f3ed0 /exec failed with exit code 1
pull / test-qnn-wheel-packages-linux (3.12) / linux-job (gh)
RuntimeError: Command docker exec -t 4a8dc3b9e225e65b058746159b271e2a384e510e68bc3c78f8f4ccfce4117d60 /exec failed with exit code 1

FLAKY - The following job failed but was likely due to flakiness present on trunk:

pull / unittest / linux / linux-job (gh) (matched linux rule in flaky-rules.json)
The runner has received a shutdown signal. This can happen when the runner service is stopped, or a manually started runner is canceled.

BROKEN TRUNK - The following jobs failed but was present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

pull / unittest / macos / macos-job (gh) (trunk failure)
RuntimeError: Command bash /Users/ec2-user/runner/_work/_temp/exec_script failed with exit code 1
pull / unittest-buck / macos / macos-job (gh) (trunk failure)
RuntimeError: Command bash /Users/ec2-user/runner/_work/_temp/exec_script failed with exit code 1
pull / unittest-editable / macos / macos-job (gh) (trunk failure)
RuntimeError: Command bash /Users/ec2-user/runner/_work/_temp/exec_script failed with exit code 1

This comment was automatically generated by Dr. CI and updates every 15 minutes.

github-actions · 2025-10-10T23:26:49Z

This PR needs a `release notes:` label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

desertfire · 2025-10-10T23:27:22Z

@desertfire has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

desertfire · 2025-10-10T23:31:48Z

backends/cuda/runtime/shims/tests/targets.bzl

    cuda_shim_cpp_unittest("aoti_torch__reinterpret_tensor")
    cuda_shim_cpp_unittest("aoti_torch_copy_")
    cuda_shim_cpp_unittest("aoti_torch_cuda_guard")
+    cuda_shim_cpp_unittest("aoti_torch_cuda__weight_int4pack_mm")


@larryliu0820 , I didn't find a CMakeLists.txt for all these unit tests. I suppose we can only test them in fbcode?

mergennachin

See inline for additional documentations (i used claude code to generate docs)

This is great, thank you!

backends/cuda/runtime/shims/int4mm.cuh

mergennachin · 2025-10-12T15:01:33Z

backends/cuda/runtime/shims/int4mm.cu

+      ret0 != nullptr,
+      InvalidArgument,
+      "aoti_torch_cuda__weight_int4pack_mm failed: ret0 is null");
+


ET_CHECK_OR_RETURN_ERROR( qGroupSize == 32 || qGroupSize == 64 || qGroupSize == 128 || qGroupSize == 256, InvalidArgument, "aoti_torch_cuda__weight_int4pack_mm: qGroupSize must be 32/64/128/256, got %lld", static_cast<long long>(qGroupSize));

mergennachin · 2025-10-12T15:03:15Z

backends/cuda/runtime/shims/int4mm.cu

+#endif
+
+AOTITorchError aoti_torch_cuda__weight_int4pack_mm(
+    Tensor* self,


should check whether self is bfloat16?

We do have quite a few tensor checking in the actual _weight_int4pack_mm_cuda function, so we don't have to do repeat them here?

mergennachin · 2025-10-12T15:03:49Z

backends/cuda/runtime/shims/int4mm.cu

+
+AOTITorchError aoti_torch_cuda__weight_int4pack_mm(
+    Tensor* self,
+    Tensor* mat2,


check whether mat2 is int32

backends/cuda/runtime/shims/int4mm.h

backends/cuda/runtime/utils.h

mergennachin · 2025-10-12T15:24:38Z

Wait, why is the "Run latency" slower than in int4 cc @swolchok

Summary: When quantizing a model with 4w_hqq (huggingface/optimum-executorch#164), AOTI-generated code will call aoti_torch_cuda__weight_int4pack_mm as a fallback op. This PR borrows the CUDA implementation of _weight_int4pack_mm_cuda from libtorch, by replacing at::Tensor and relevant utility functions with ET equivalents. Using the Voxtral runner as an example, With the bfloat16 format, here is the generated ptd file size and latency. ``` aoti_cuda_blob.ptd: 9.0 GB Program load latency (ms): 0.054 Method load latency (ms): audio_encoder: 1492.989 token_embedding: 803.561 text_decoder: 6556.770 Run latency (ms): audio_encoder: 76.848 token_embedding: 6.479 text_decoder: 149.128 ``` With `--qlinear 4w_hqq --qlinear_encoder 4w_hqq`, the ptd file size is cut more than half, with slowdowns in the encoder and decoder parts. ``` aoti_cuda_blob.ptd: 3.7 GB Program load latency (ms): 0.051 Method load latency (ms): audio_encoder: 716.667 token_embedding: 633.476 text_decoder: 1840.760 Run latency (ms): audio_encoder: 329.274 token_embedding: 4.285 text_decoder: 335.590 ``` Differential Revision: [D84395275](https://our.internmc.facebook.com/intern/diff/D84395275) [ghstack-poisoned]

Summary: When quantizing a model with 4w_hqq (huggingface/optimum-executorch#164), AOTI-generated code will call aoti_torch_cuda__weight_int4pack_mm as a fallback op. This PR borrows the CUDA implementation of _weight_int4pack_mm_cuda from libtorch, by replacing at::Tensor and relevant utility functions with ET equivalents. Using the Voxtral runner as an example, With the bfloat16 format, here is the generated ptd file size and latency. ``` aoti_cuda_blob.ptd: 9.0 GB Program load latency (ms): 0.054 Method load latency (ms): audio_encoder: 1492.989 token_embedding: 803.561 text_decoder: 6556.770 Run latency (ms): audio_encoder: 76.848 token_embedding: 6.479 text_decoder: 149.128 ``` With `--qlinear 4w_hqq --qlinear_encoder 4w_hqq`, the ptd file size is cut more than half, with slowdowns in the encoder and decoder parts. ``` aoti_cuda_blob.ptd: 3.7 GB Program load latency (ms): 0.051 Method load latency (ms): audio_encoder: 716.667 token_embedding: 633.476 text_decoder: 1840.760 Run latency (ms): audio_encoder: 329.274 token_embedding: 4.285 text_decoder: 335.590 ``` ghstack-source-id: a0c94a0 Pull Request resolved: #15030

Summary: When quantizing a model with 4w_hqq (huggingface/optimum-executorch#164), AOTI-generated code will call aoti_torch_cuda__weight_int4pack_mm as a fallback op. This PR borrows the CUDA implementation of _weight_int4pack_mm_cuda from libtorch, by replacing at::Tensor and relevant utility functions with ET equivalents. Using the Voxtral runner as an example, With the bfloat16 format, here is the generated ptd file size and latency. ``` aoti_cuda_blob.ptd: 9.0 GB Program load latency (ms): 0.054 Method load latency (ms): audio_encoder: 1492.989 token_embedding: 803.561 text_decoder: 6556.770 Run latency (ms): audio_encoder: 76.848 token_embedding: 6.479 text_decoder: 149.128 ``` With `--qlinear 4w_hqq --qlinear_encoder 4w_hqq`, the ptd file size is cut more than half, with slowdowns in the encoder and decoder parts. ``` aoti_cuda_blob.ptd: 3.7 GB Program load latency (ms): 0.051 Method load latency (ms): audio_encoder: 716.667 token_embedding: 633.476 text_decoder: 1840.760 Run latency (ms): audio_encoder: 329.274 token_embedding: 4.285 text_decoder: 335.590 ``` Differential Revision: [D84395275](https://our.internmc.facebook.com/intern/diff/D84395275) [ghstack-poisoned]

Summary: When quantizing a model with 4w_hqq (huggingface/optimum-executorch#164), AOTI-generated code will call aoti_torch_cuda__weight_int4pack_mm as a fallback op. This PR borrows the CUDA implementation of _weight_int4pack_mm_cuda from libtorch, by replacing at::Tensor and relevant utility functions with ET equivalents. Using the Voxtral runner as an example, With the bfloat16 format, here is the generated ptd file size and latency. ``` aoti_cuda_blob.ptd: 9.0 GB Program load latency (ms): 0.054 Method load latency (ms): audio_encoder: 1492.989 token_embedding: 803.561 text_decoder: 6556.770 Run latency (ms): audio_encoder: 76.848 token_embedding: 6.479 text_decoder: 149.128 ``` With `--qlinear 4w_hqq --qlinear_encoder 4w_hqq`, the ptd file size is cut more than half, with slowdowns in the encoder and decoder parts. ``` aoti_cuda_blob.ptd: 3.7 GB Program load latency (ms): 0.051 Method load latency (ms): audio_encoder: 716.667 token_embedding: 633.476 text_decoder: 1840.760 Run latency (ms): audio_encoder: 329.274 token_embedding: 4.285 text_decoder: 335.590 ``` ghstack-source-id: 29b5b16 Pull Request resolved: #15030

desertfire · 2025-10-13T13:41:41Z

@desertfire has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

desertfire · 2025-10-13T17:03:37Z

Wait, why is the "Run latency" slower than in int4 cc @swolchok

@jerryzh168 , I did a quick nsys profile and found out tinygemm_m16n8k16_chunk_kernel is now the top1 kernel (87.4% execution time among all the CUDA kernels) and seems pretty slow (0.552s). Is this something you have seen before?

jerryzh168 · 2025-10-13T17:25:47Z

Wait, why is the "Run latency" slower than in int4 cc @swolchok

@jerryzh168 , I did a quick nsys profile and found out tinygemm_m16n8k16_chunk_kernel is now the top1 kernel (87.4% execution time among all the CUDA kernels) and seems pretty slow (0.552s). Is this something you have seen before?

this kernel is only optimized for batch size 1, is this what you are testing?

jerryzh168 · 2025-10-13T17:29:33Z

depends on the hardware I think, if it has to be A100, then seems like the only other option is gemlite kernels, which is written in triton

if H100, we can integrate fbgemm kernels, the effort will be similar to the current PR I think.

@desertfire

This PR was created by the merge bot to help merge the original PR into the main branch. ghstack PR number: #15030 by @desertfire ^ Please use this as the source of truth for the PR details, comments, and reviews ghstack PR base: https://github.com/pytorch/executorch/tree/gh/desertfire/1/base ghstack PR head: https://github.com/pytorch/executorch/tree/gh/desertfire/1/head Merge bot PR base: https://github.com/pytorch/executorch/tree/main Merge bot PR head: https://github.com/pytorch/executorch/tree/gh/desertfire/1/orig Differential Revision: [D84395275](https://our.internmc.facebook.com/intern/diff/D84395275) @diff-train-skip-merge Co-authored-by: Bin Bao <[email protected]>

desertfire requested review from kirklandsign and larryliu0820 as code owners October 10, 2025 23:26

meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Oct 10, 2025

desertfire added the topic: not user facing label Oct 10, 2025

desertfire requested a review from mergennachin October 10, 2025 23:30

desertfire commented Oct 10, 2025

View reviewed changes

mergennachin approved these changes Oct 12, 2025

View reviewed changes

mergennachin requested review from Gasoonjia and swolchok October 12, 2025 15:25

meta-codesync bot merged commit 63b8a91 into gh/desertfire/1/base Oct 14, 2025
130 of 139 checks passed

meta-codesync bot deleted the gh/desertfire/1/head branch October 14, 2025 02:14

meta-codesync bot temporarily deployed to cherry-pick-bot October 14, 2025 02:14 Inactive

pytorchbot mentioned this pull request Oct 14, 2025

Support aoti_torch_cuda__weight_int4pack_mm #15089

Merged

Support aoti_torch_cuda__weight_int4pack_mm #15030

Support aoti_torch_cuda__weight_int4pack_mm #15030

Uh oh!

Conversation

desertfire commented Oct 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Oct 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/15030

❗ 2 Active SEVs

❌ 3 New Failures, 4 Unrelated Failures

Uh oh!

github-actions bot commented Oct 10, 2025

This PR needs a release notes: label

Uh oh!

desertfire commented Oct 10, 2025

Uh oh!

desertfire Oct 10, 2025

Choose a reason for hiding this comment

Uh oh!

mergennachin left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mergennachin Oct 12, 2025

Choose a reason for hiding this comment

Uh oh!

mergennachin Oct 12, 2025

Choose a reason for hiding this comment

Uh oh!

desertfire Oct 13, 2025

Choose a reason for hiding this comment

Uh oh!

mergennachin Oct 12, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

mergennachin commented Oct 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

desertfire commented Oct 13, 2025

Uh oh!

desertfire commented Oct 13, 2025

Uh oh!

jerryzh168 commented Oct 13, 2025

Uh oh!

jerryzh168 commented Oct 13, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

desertfire commented Oct 10, 2025 •

edited

Loading

pytorch-bot bot commented Oct 10, 2025 •

edited

Loading

This PR needs a `release notes:` label

mergennachin left a comment •

edited

Loading

mergennachin commented Oct 12, 2025 •

edited

Loading